Skip to content

Support workflow terminate rollback#14465

Open
vaishnav-mk wants to merge 10 commits into
mainfrom
vaish/terminate-rollback
Open

Support workflow terminate rollback#14465
vaishnav-mk wants to merge 10 commits into
mainfrom
vaish/terminate-rollback

Conversation

@vaishnav-mk

@vaishnav-mk vaishnav-mk commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Adds terminate rollback support across Workflows local tooling and the remote Wrangler terminate command.

This wires rollback: true through the terminate path:

  • @cloudflare/workflows-shared binding and local engine
  • Miniflare wrapped Workflows binding
  • Local Explorer API/OpenAPI schema
  • Wrangler workflows instances terminate --rollback
  • Wrangler workflows instances terminate --local --rollback

The production Workflows API already accepts terminate rollback; this PR adds the SDK/Wrangler client surfaces and fixes local rollback recovery so rollback can run after local engine restart/eviction.

Rollback is only valid for terminate and is only serialized when explicitly set to true.


A picture of a cute animal (not mandatory, but encouraged)

@changeset-bot

changeset-bot Bot commented Jun 29, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: ce29ed6

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 7 packages
Name Type
@cloudflare/workflows-shared Minor
wrangler Minor
miniflare Minor
@cloudflare/vitest-pool-workers Patch
@cloudflare/vite-plugin Patch
@cloudflare/deploy-helpers Patch
@cloudflare/pages-shared Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@github-actions

github-actions Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

✅ All changesets look good

@vaishnav-mk vaishnav-mk force-pushed the vaish/terminate-rollback branch 2 times, most recently from 4ccbf75 to 01fd44d Compare June 29, 2026 06:36
@pkg-pr-new

pkg-pr-new Bot commented Jun 29, 2026

Copy link
Copy Markdown
@cloudflare/autoconfig

npm i https://pkg.pr.new/@cloudflare/autoconfig@14465

create-cloudflare

npm i https://pkg.pr.new/create-cloudflare@14465

@cloudflare/deploy-helpers

npm i https://pkg.pr.new/@cloudflare/deploy-helpers@14465

@cloudflare/kv-asset-handler

npm i https://pkg.pr.new/@cloudflare/kv-asset-handler@14465

miniflare

npm i https://pkg.pr.new/miniflare@14465

@cloudflare/pages-shared

npm i https://pkg.pr.new/@cloudflare/pages-shared@14465

@cloudflare/unenv-preset

npm i https://pkg.pr.new/@cloudflare/unenv-preset@14465

@cloudflare/vite-plugin

npm i https://pkg.pr.new/@cloudflare/vite-plugin@14465

@cloudflare/vitest-pool-workers

npm i https://pkg.pr.new/@cloudflare/vitest-pool-workers@14465

@cloudflare/workers-auth

npm i https://pkg.pr.new/@cloudflare/workers-auth@14465

@cloudflare/workers-editor-shared

npm i https://pkg.pr.new/@cloudflare/workers-editor-shared@14465

@cloudflare/workers-utils

npm i https://pkg.pr.new/@cloudflare/workers-utils@14465

wrangler

npm i https://pkg.pr.new/wrangler@14465

commit: ce29ed6

@vaishnav-mk vaishnav-mk force-pushed the vaish/terminate-rollback branch 5 times, most recently from f9b1b1a to 9ed9edf Compare June 29, 2026 06:59
@ask-bonk

ask-bonk Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

@vaishnav-mk Bonk workflow was cancelled.

View workflow run · To retry, trigger Bonk again.

@vaishnav-mk vaishnav-mk force-pushed the vaish/terminate-rollback branch 3 times, most recently from 50bf663 to 2c0a39a Compare June 29, 2026 07:09
Comment thread packages/miniflare/src/workers/local-explorer/openapi.local.json Outdated
Comment thread packages/miniflare/src/workers/workflows/wrapped-binding.worker.ts Outdated
Comment thread packages/workflows-shared/src/binding.ts Outdated
@vaishnav-mk vaishnav-mk force-pushed the vaish/terminate-rollback branch 2 times, most recently from 4bafb21 to ef038d9 Compare June 29, 2026 10:23
@vaishnav-mk vaishnav-mk marked this pull request as ready for review June 29, 2026 10:55
@workers-devprod workers-devprod requested review from a team and james-elicx and removed request for a team June 29, 2026 10:55
@workers-devprod

workers-devprod commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Codeowners approval required for this PR:

  • @cloudflare/wrangler
  • ✅ @cloudflare/workflows
Show detailed file reviewers
  • .changeset/workflows-terminate-rollback.md: [@cloudflare/wrangler]
  • packages/miniflare/scripts/openapi-filter-config.ts: [@cloudflare/wrangler]
  • packages/miniflare/src/workers/local-explorer/generated/types.gen.ts: [@cloudflare/wrangler]
  • packages/miniflare/src/workers/local-explorer/generated/zod.gen.ts: [@cloudflare/wrangler]
  • packages/miniflare/src/workers/local-explorer/openapi.local.json: [@cloudflare/wrangler]
  • packages/miniflare/src/workers/local-explorer/resources/workflows.ts: [@cloudflare/wrangler]
  • packages/miniflare/src/workers/workflows/wrapped-binding.worker.ts: [@cloudflare/wrangler]
  • packages/workflows-shared/src/binding.ts: [@cloudflare/wrangler]
  • packages/workflows-shared/src/context.ts: [@cloudflare/wrangler]
  • packages/workflows-shared/src/engine.ts: [@cloudflare/wrangler]
  • packages/workflows-shared/src/lib/rollback.ts: [@cloudflare/wrangler]
  • packages/workflows-shared/tests/engine.test.ts: [@cloudflare/wrangler]
  • packages/wrangler/src/tests/workflows.test.ts: [@cloudflare/wrangler]
  • packages/wrangler/src/workflows/commands/instances/terminate.ts: [@cloudflare/wrangler]
  • packages/wrangler/src/workflows/local.ts: [@cloudflare/wrangler]
  • packages/wrangler/src/workflows/utils.ts: [@cloudflare/wrangler]

devin-ai-integration[bot]

This comment was marked as resolved.

@pombosilva pombosilva self-requested a review June 29, 2026 11:29
@vaishnav-mk vaishnav-mk force-pushed the vaish/terminate-rollback branch from ef038d9 to b1be7f1 Compare June 30, 2026 11:18
@vaishnav-mk

Copy link
Copy Markdown
Contributor Author

/bonk review

@ask-bonk ask-bonk Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Support workflow terminate rollback

Overall this is a well-structured change. The replay-to-rebuild-registry approach correctly addresses the "empty registry after DO restart/pause" limitation that Devin originally flagged, and the persisted hasRollback metadata is a clean way to know which cached steps to replay. Tests cover the live-registry, replayed-registry, and paused-registry cases. A few things worth addressing before merge.

Blocking / needs discussion

1. The remote (production) API path also sends rollback: true — is this intended/supported?

packages/wrangler/src/workflows/utils.ts:131 adds rollback: true to the body of the remote PATCH /accounts/.../instances/:id/status call. The PR description scopes this to "local Workflows tooling," and the engine implementation lives entirely in @cloudflare/workflows-shared (local). If the production Workflows REST API doesn't yet accept a rollback field, this either silently no-ops or could be rejected — yet the CLI unconditionally prints 🥷 The instance "..." was terminated successfully (terminate.ts:65). If the remote API isn't ready, consider gating --rollback to --local only (and erroring otherwise), so users don't get a false success message for a rollback that never ran. If the remote API is ready, the PR description should say so.

Non-blocking

2. No user-facing signal when rollback runs vs. is skipped.

This was Devin's original concern. The replay path mitigates the empty-registry case, but if getEligibleRollbackSteps() returns nothing (no eligible steps, or malformed metadata), rollbacks are silently skipped while the CLI still reports plain success. Consider surfacing in the success log whether rollbacks executed (e.g. count of rollback steps run), so the behavior is observable.

3. replayRollbackRegistry re-runs the user workflow — confirm side-effect safety for uncached steps.

engine.ts:983 calls USER_WORKFLOW.run(...) in replay phase. Cached step.do calls correctly short-circuit to their stored value (context.ts:438-455) and sleep/sleepUntil/waitForEvent return early in replay phase, so a paused/sleeping instance halts replay safely. But if an instance is terminated while a new, uncached step.do would be the next thing the user code reaches before any sleep/wait, that closure would actually execute with real side effects during replay. In practice terminate targets running/paused instances, so this is unlikely — but a comment documenting this assumption (replay relies on all rollback-eligible steps already being cached and execution halting at the next suspension point) would help future maintainers.

4. Swallowed errors in replayRollbackRegistry and userTriggeredTerminate.

engine.ts:987 catches and discards all replay errors, and engine.ts:1015 logs rollback failures only to console.error. That matches the production "best-effort" rollback model, but a malformed user workflow throwing during replay will silently produce zero rollbacks. At minimum, consider a debug-level log so local users can diagnose "why didn't my rollback run?"

Minor / nits

  • engine.ts:316-319 in setRollbackPhase/rollbackPhase use a string-literal union ("replay" | "rollback" | undefined) repeated in several signatures. A small exported type alias would reduce duplication.
  • resources/workflows.ts:923 still casts handle as unknown as WorkflowHandle to call terminate(options). Same TODO as the wrapped-binding.worker.ts cast — worth a single shared note/tracking issue referencing the @cloudflare/workers-types terminate-options gap so both casts get removed together.
  • getEligibleRollbackSteps reads all states rows twice and builds two sets; fine for local volumes, just noting it scans the full log table per terminate.

Tests / changeset

  • Changeset (minor for workflows-shared/wrangler/miniflare) and wording look correct.
  • Wrangler tests now assert the exact request body for both local and remote --rollback, which is good. If point (1) results in gating --rollback to local-only, the remote test should be updated accordingly.

@ask-bonk

ask-bonk Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

I reviewed PR #14465 and posted a review. Summary of what I found:

Strengths

  • The replay-to-rebuild-registry approach (replayRollbackRegistry + persisted hasRollback metadata) cleanly solves the empty-registry-after-DO-restart limitation Devin flagged. Cached steps short-circuit and sleep/waitForEvent return early in replay phase, so the design is sound.
  • Tests cover live, replayed, and paused registry cases; changeset and request-body assertions are correct.

Main concern (blocking/discussion)

  1. The remote production API path (utils.ts:131) also sends rollback: true, but the engine implementation is local-only. If the production REST API doesn't support rollback, the CLI still prints "terminated successfully" — a false signal. Suggested gating --rollback to --local, or confirming the remote API is ready.

Non-blocking
2. No user-facing signal whether rollbacks actually ran vs. were silently skipped.
3. replayRollbackRegistry re-runs user code — safe for cached/suspended instances, but worth a comment documenting the assumption that uncached steps won't execute.
4. Swallowed errors in replay/rollback could hide "why didn't my rollback run" — suggest a debug log.

Nits: repeated phase string-union (extract a type), the remaining as unknown as WorkflowHandle cast mirroring the wrapped-binding.worker.ts TODO, and the double full-table scan in getEligibleRollbackSteps.

github run

devin-ai-integration[bot]

This comment was marked as resolved.

@vaishnav-mk

Copy link
Copy Markdown
Contributor Author

/bonk review

@ask-bonk ask-bonk Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Support workflow terminate rollback (re-review @ dd02374)

Re-reviewing after the two commits added since the last /bonk review (6c313a6dd02374). Most of the previously-raised concerns are now resolved. Details below.

Previously-raised items — status

1. Remote (production) API path sends rollback: true — now substantiated. ✅ (with one ask)

The prior review flagged utils.ts:131 as blocking because it was unclear whether the production API accepts rollback on terminate. The linked docs PR (cloudflare-docs#31769) documents the remote CLI form as a supported operation:

npx wrangler workflows instances terminate <WORKFLOW_NAME> <INSTANCE_ID> --rollback

and describes production semantics ("Workflows runs the rollback handlers registered by completed or eligible steps before the instance reaches the terminated state. Steps without rollback handlers are skipped."). That's good evidence the production API is ready and this is intentional, so I no longer consider it blocking. One remaining ask: please confirm the production PATCH .../status endpoint accepting rollback is already deployed (not gated behind the docs merge), since terminate.ts:65 prints unconditional success — if the field is silently dropped server-side today, users get a false "terminated successfully" for a rollback that never ran.

2. Devin's "failed steps won't re-register rollback handlers on replay" — resolved. ✅

dd02374 fixes this directly: the maybeError branch in context.ts:461-475 now calls #registerRollback before re-throwing the cached error, and the replay branch at context.ts:484-497 registers rollback and returns early. New test replays cached failed steps before terminate rollback when registry is empty (engine.test.ts) asserts ROLLBACK_STEP_SUCCESS fires for the failed step. Good coverage.

3. Repeated phase string-union nit — resolved. ✅
RollbackPhase type alias is now used across setRollbackPhase / rollbackPhase (engine.ts:336).

4. Double full-table scan in getEligibleRollbackSteps — resolved. ✅
Now a single SELECT ... ORDER BY id DESC pass building the sets in one loop (engine.ts:256-321).

Still open — non-blocking

5. Swallowed errors in replayRollbackRegistry.
engine.ts:988 catches and discards all replay errors with only a comment; engine.ts:1017 logs rollback failures to console.error. For a local-dev tool, "why didn't my rollback run?" is a real question. A debug-level log in the replay catch (even just the error message) would make silent zero-rollback outcomes diagnosable.

6. No user-facing signal for rollback outcome.
terminate.ts:65 prints plain success regardless of whether any rollback steps executed, were skipped, or failed. Not blocking, but surfacing a count (e.g. "ran N rollback step(s)") would make the feature observable — especially relevant given (5) and the empty-registry edge cases.

Minor / nits

  • resources/workflows.ts:924 still uses handle as unknown as WorkflowHandle to pass terminate options, mirroring the same @cloudflare/workers-types gap. Since the docs PR now types terminate(options?: WorkflowInstanceTerminateOptions), worth a single tracking note so this cast is removed once the types package ships.
  • registerRollbackFn (lib/rollback.ts) now updates stepContext/output in place on re-registration instead of disposing — correct for replay, but confirm the previous dupRollbackStub/disposeRollbackStub lifecycle isn't leaking stubs when a cacheKey re-registers with a fresh fn (the update path keeps existing.fn and drops the new registration.fn without disposing it).

Tests / changeset

  • Changeset (minor for workflows-shared/wrangler/miniflare, patches for dependents) looks correct.
  • Wrangler tests assert exact request bodies for both local and remote --rollback — good.

Overall: the design (persisted hasRollback + replay-to-rebuild-registry) is sound and the newest commit closed the last correctness gap Devin flagged. Approving pending confirmation on point (1) — production API rollback support being live.

@ask-bonk

ask-bonk Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Review posted to PR #14465. Summary of what I found on re-review (head commit dd02374, which added two commits since the previous /bonk review):

Resolved since last review

  • Remote API concern (previously blocking): the linked docs PR documents the remote --rollback CLI form and production semantics, indicating the production API supports it. Downgraded to a non-blocking confirmation ask (is the endpoint live, given terminate.ts prints unconditional success).
  • Devin's failed-step replay concern: fixed in dd02374context.ts now registers rollback handlers before re-throwing cached errors and in the replay branch, with a new test asserting ROLLBACK_STEP_SUCCESS for failed steps.
  • Phase string-union nit: RollbackPhase type alias now used.
  • Double table-scan nit: now a single SQL pass.

Still open (non-blocking)

  • Swallowed errors in replayRollbackRegistry/userTriggeredTerminate — suggest a debug log.
  • No user-facing signal whether rollbacks ran vs. skipped.

Nits

  • Remaining as unknown as WorkflowHandle cast in resources/workflows.ts:924.
  • registerRollbackFn re-registration path keeps existing.fn and drops the new fn without disposing — worth confirming no stub leak.

Net: sound design, last correctness gap closed; approving pending confirmation on production API readiness.

github run

@vaishnav-mk

Copy link
Copy Markdown
Contributor Author

/bonk rereview

@ask-bonk ask-bonk Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Support workflow terminate rollback (re-review @ 7b9fa6a)

Re-reviewing after the one commit added since the last /bonk review (dd023747b9fa6a, "Clarify rollback replay diagnostics"). That commit directly addresses two of the open non-blocking items from last time, so the diff is small and additive. Design remains sound.

Newly resolved since last review

Swallowed replay errors — resolved. ✅
engine.ts:988-991 now captures the replay error and emits console.debug("Rollback replay stopped:", replayErr) instead of silently discarding it. This makes the "why didn't my rollback run?" case diagnosable in local dev, which was my previous item (5). Since workflows-shared is a Worker package (not wrangler), console.debug is the right primitive here — consistent with the existing console.error/console.warn usage in this file.

registerRollbackFn re-registration stub lifecycle — clarified. ✅
lib/rollback.ts:113-114 now documents that the existing entry already owns the duped stub and duplicate registrations refresh stepContext/output only (the helper hasn't duped the incoming fn, so there's nothing to dispose). This resolves the leak concern I raised last time — confirms it's intentional, not an oversight.

Still open — non-blocking

1. Confirm production PATCH .../status accepts rollback and is deployed.
Still my only real ask. terminate.ts:65 prints unconditional 🥷 ... terminated successfully regardless of whether the remote API honored rollback: true (utils.ts:131). I checked the linked docs PR (cloudflare-docs#31769) — it's still in Draft with no approving reviews yet, so it documents intended semantics but doesn't by itself prove the production endpoint is live today. If the field is silently dropped server-side, remote users get a false success for a rollback that never ran. Please confirm the endpoint is deployed (independent of the docs merge); if it isn't, gate --rollback to --local until it is.

2. Rollback outcome still not surfaced to the user.
Nice to see executeRollbacks now returns { ranAny, allSucceeded } (lib/rollback.ts:172,225) — but both call sites (engine.ts:1016 and engine.ts:1302) discard the result, so it never reaches the CLI, and terminate.ts still prints a flat success. The plumbing is now half there; propagating a count (e.g. "ran N rollback step(s)") would close the observability gap and pairs well with item (1). Non-blocking.

Minor / nits (unchanged)

  • resources/workflows.ts:924 still uses handle as unknown as WorkflowHandle for terminate(options), mirroring the @cloudflare/workers-types terminate-options gap (same TODO as the restart cast at :919). Worth a single tracking note so both casts are removed once the types package ships.

Tests / changeset

  • No test/changeset changes in this commit; the prior coverage (live / replayed / paused registry, failed-step re-registration asserting ROLLBACK_STEP_SUCCESS, and exact request-body assertions for local + remote --rollback) still stands.
  • Changeset (minor for workflows-shared/wrangler/miniflare, patches for dependents) looks correct.

Overall: the last two commits closed the correctness gap (failed-step replay) and the diagnosability gap (replay debug log). The design (persisted hasRollback + replay-to-rebuild-registry) is solid. Approving pending confirmation on item (1) — production API rollback support being deployed, given the unconditional success message and the still-draft docs PR.

@ask-bonk

ask-bonk Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Re-review posted to PR #14465.

Summary

I re-reviewed at head commit 7b9fa6a, which added one commit ("Clarify rollback replay diagnostics") since my last review at dd02374. The diff was small and additive (+4/-1 across engine.ts and lib/rollback.ts).

Newly resolved:

  • Swallowed replay errorsengine.ts:988-991 now emits console.debug("Rollback replay stopped:", replayErr) instead of silently discarding, making empty-rollback outcomes diagnosable.
  • registerRollbackFn stub lifecycle nit — a clarifying comment at lib/rollback.ts:113-114 confirms the existing entry owns the duped stub and re-registration intentionally refreshes only context/output (no leak).

Still open (non-blocking):

  1. Production API readiness — my one remaining ask. terminate.ts:65 prints unconditional success while utils.ts:131 sends rollback: true to the remote endpoint. I verified the linked docs PR (cloudflare-docs#31769) is still in Draft with no approvals, so it doesn't prove the production endpoint is deployed. Asked to confirm the endpoint is live or gate --rollback to --local.
  2. Rollback outcome not surfacedexecuteRollbacks now returns {ranAny, allSucceeded} but both call sites discard it, so the CLI still prints flat success. Plumbing is half there; propagating a count would close the gap.

Minor: the remaining as unknown as WorkflowHandle cast at resources/workflows.ts:924 (types-package gap).

Net: last two commits closed the failed-step replay correctness gap and the diagnosability gap. Design is sound; approving pending confirmation on the production API being deployed.

github run

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

Open in Devin Review

Comment on lines +968 to +996
private async replayRollbackRegistry(
metadata: InstanceMetadata
): Promise<void> {
if (this.rollbackRegistry.size > 0) {
return;
}

const eligible = this.getEligibleRollbackSteps();
if (eligible.length === 0) {
return;
}

this.rollbackEligibleCacheKeys = new Set(eligible);
const stubStep = this.createRollbackContext();
this.setRollbackPhase("replay");
try {
await this.env.USER_WORKFLOW.run(
metadata.event,
stubStep as unknown as WorkflowStep
);
} catch (replayErr) {
// Match the production engine: replay may stop on normal workflow control
// flow; rollback execution uses whatever handlers replay registered.
console.debug("Rollback replay stopped:", replayErr);
} finally {
this.setRollbackPhase(undefined);
this.rollbackEligibleCacheKeys = undefined;
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Replay mechanism re-runs the user's workflow function to reconstruct rollback handlers

The replayRollbackRegistry method at packages/workflows-shared/src/engine.ts:968-996 re-invokes this.env.USER_WORKFLOW.run(metadata.event, stubStep) with the rollback phase set to 'replay'. During replay, step.do returns cached values (or undefined for uncached steps) and sleep/sleepUntil are no-ops. This is a powerful but fragile approach: it assumes the workflow's control flow is deterministic and that re-running it with the same event will encounter the same steps in the same order. If the workflow has non-deterministic branching (e.g., based on Date.now() or random values), replay could register rollback handlers for the wrong steps. The rollbackEligibleCacheKeys filter at packages/workflows-shared/src/engine.ts:324-331 mitigates this by only keeping handlers whose cache keys match persisted eligible steps, but a reviewer should consider whether this is sufficient for all workflow patterns.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Untriaged

Development

Successfully merging this pull request may close these issues.

3 participants